# NY TAXI DATA SCIENCE

NY TAXI DATA SCIENCE

 
## Contents 
#### Insight 1: Passenger Numbers
#### Insight 2: Cash versus Credit 
#### Insight 3: Fare Breakdown
#### Insight 4: Pick-up and Drop-off Locations 
#### Insight 5: Average Fare by Day and Time
#### Insight 6: Busiest City Locations
## Summary
____
**Solutions to** the **bold questions** below are included in this notebook
____
###### Suggested Basic Questions:
1. What are the **distributions of the number of passengers per trip** (see Insight 1), **payment type, fare amount, tip amount, and total amount** (see Insights 2 & 3)?
2. What are top 5 busiest hours of the day, and the **top 10 busiest locations of the city**? (see Insight 6)
3. What is the **hourly taxi activity for each day of the week** (see Insight 5)?
4. **Which trip has the most consistent fares** (see Insight 2)? 
Manhattan to JFK Airport (set fare of $52)
###### Suggested Open Questions:
1. Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?
2. Can you predict the pickup / drop off geographical distribution for each hour of a weekday?
3. ** If you were a taxi owner, how would you maximize your earnings in a day? **
    * Work the early shift (The data show above average fares from 3 am until 7 am)
4. **If you run a taxi company, how would you maximize your earnings?**
    * In short: More data needed!
    
    Uber is a major market disruptor in the taxi space.  To maximise taxi company earnings is necessary to discover how old school taxis can strategically adapt to thrive in current market conditions.  
    
    Data needed to support the taxi company to maximise their earnings going forward could include:
        * Concurrent analysis of Uber versus taxi data
        * Trends within taxi data for the last 2-3 years
    
        * ---
    
        The data show that most taxis are hailed from busy streets (Insight 4).  On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi).  Issue temporarily addressed with UberT (could request a yellow taxi to your door through the Uber app, $2 surcharge to Uber, service ended Aug 2016).
    
        ---
    

Contents

Insight 1: Passenger Numbers

Insight 2: Cash versus Credit

Insight 3: Fare Breakdown

Insight 4: Pick-up and Drop-off Locations

Insight 5: Average Fare by Day and Time

Insight 6: Busiest City Locations

Summary


Solutions to the bold questions below are included in this notebook


Suggested Basic Questions:
  1. What are the distributions of the number of passengers per trip (see Insight 1), payment type, fare amount, tip amount, and total amount (see Insights 2 & 3)?

  2. What are top 5 busiest hours of the day, and the top 10 busiest locations of the city? (see Insight 6)

  3. What is the hourly taxi activity for each day of the week (see Insight 5)?

  4. Which trip has the most consistent fares (see Insight 2)? Manhattan to JFK Airport (set fare of $52)

Suggested Open Questions:
  1. Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?

  2. Can you predict the pickup / drop off geographical distribution for each hour of a weekday?

  3. If you were a taxi owner, how would you maximize your earnings in a day?

    • Work the early shift (The data show above average fares from 3 am until 7 am)
  4. If you run a taxi company, how would you maximize your earnings?

    • In short: More data needed!

      Uber is a major market disruptor in the taxi space. To maximise taxi company earnings is necessary to discover how old school taxis can strategically adapt to thrive in current market conditions.

      Data needed to support the taxi company to maximise their earnings going forward could include:

      • Concurrent analysis of Uber versus taxi data
      • Trends within taxi data for the last 2-3 years

      The data show that most taxis are hailed from busy streets (Insight 4). On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi). Issue temporarily addressed with UberT (could request a yellow taxi to your door through the Uber app, $2 surcharge to Uber, service ended Aug 2016).


In [1]:
 
import pandas as pd
import numpy as np
import matplotlib  
import matplotlib.pyplot as plt 
import numpy as np
import plotly.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly import tools
from IPython.display import Image
from IPython.display import display, Math, Latex 
from IPython.core.display import HTML 
#initiate the Plotly Notebook mode
init_notebook_mode()
df_big = pd.read_csv('../data/yellow_tripdata_2016-01.csv')
#df_big_clean=df_big.fillna(df_big.mean())#df_big.dropna(axis=1)
df_big_clean=df_big
#df_big_clean <- df_big[!(is.na(df$start_pc) | df$start_pc==""), ] #| is an or-operator and ! inverts. 
#Hence, the command above displays all rows, which are not b) NA or b) equal to ""
df=df_big_clean.loc[0:10000,:]  #use reduces data points for testing mode
#df=df_big                      # use whole month of data
print(df_big.shape)
print(df_big_clean.shape)
df
(2389990, 19)
(2389990, 19)
Out[1]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2016-01-01 00:00:00 2016-01-01 00:00:00 2 1.10 -73.990372 40.734695 1 N -73.981842 40.732407 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
1 2 2016-01-01 00:00:00 2016-01-01 00:00:00 5 4.90 -73.980782 40.729912 1 N -73.944473 40.716679 1 18.0 0.5 0.5 0.00 0.0 0.3 19.30
2 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 10.54 -73.984550 40.679565 1 N -73.950272 40.788925 1 33.0 0.5 0.5 0.00 0.0 0.3 34.30
3 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 4.75 -73.993469 40.718990 1 N -73.962242 40.657333 2 16.5 0.0 0.5 0.00 0.0 0.3 17.30
4 2 2016-01-01 00:00:00 2016-01-01 00:00:00 3 1.76 -73.960625 40.781330 1 N -73.977264 40.758514 2 8.0 0.0 0.5 0.00 0.0 0.3 8.80
5 2 2016-01-01 00:00:00 2016-01-01 00:18:30 2 5.52 -73.980118 40.743050 1 N -73.913490 40.763142 2 19.0 0.5 0.5 0.00 0.0 0.3 20.30
6 2 2016-01-01 00:00:00 2016-01-01 00:26:45 2 7.45 -73.994057 40.719990 1 N -73.966362 40.789871 2 26.0 0.5 0.5 0.00 0.0 0.3 27.30
7 1 2016-01-01 00:00:01 2016-01-01 00:11:55 1 1.20 -73.979424 40.744614 1 N -73.992035 40.753944 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
8 1 2016-01-01 00:00:02 2016-01-01 00:11:14 1 6.00 -73.947151 40.791046 1 N -73.920769 40.865578 2 18.0 0.5 0.5 0.00 0.0 0.3 19.30
9 2 2016-01-01 00:00:02 2016-01-01 00:11:08 1 3.21 -73.998344 40.723896 1 N -73.995850 40.688400 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
10 2 2016-01-01 00:00:03 2016-01-01 00:06:19 1 0.79 -74.006149 40.744919 1 N -73.993797 40.741440 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
11 2 2016-01-01 00:00:03 2016-01-01 00:15:49 6 2.43 -73.969330 40.763538 1 N -73.995689 40.744251 1 12.0 0.5 0.5 3.99 0.0 0.3 17.29
12 2 2016-01-01 00:00:03 2016-01-01 00:00:11 4 0.01 -73.989021 40.721539 1 N -73.988960 40.721699 2 2.5 0.5 0.5 0.00 0.0 0.3 3.80
13 1 2016-01-01 00:00:04 2016-01-01 00:14:32 1 3.70 -74.004303 40.742241 1 N -74.007362 40.706936 1 14.0 0.5 0.5 3.05 0.0 0.3 18.35
14 1 2016-01-01 00:00:05 2016-01-01 00:14:27 2 2.20 -73.991997 40.718578 1 N -74.005135 40.739944 1 11.0 0.5 0.5 1.50 0.0 0.3 13.80
15 2 2016-01-01 00:00:05 2016-01-01 00:07:17 1 0.54 -73.985161 40.768951 1 N -73.990227 40.761730 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
16 2 2016-01-01 00:00:05 2016-01-01 00:07:14 1 1.92 -73.973091 40.795361 1 N -73.978371 40.773151 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
17 1 2016-01-01 00:00:06 2016-01-01 00:04:44 1 1.70 -73.982101 40.774696 1 Y -73.970940 40.796707 1 7.0 0.5 0.5 1.65 0.0 0.3 9.95
18 2 2016-01-01 00:00:06 2016-01-01 00:07:14 1 1.38 -73.994843 40.718498 1 N -73.989807 40.734230 1 7.0 0.5 0.5 1.66 0.0 0.3 9.96
19 1 2016-01-01 00:00:07 2016-01-01 00:20:35 2 4.90 -73.953033 40.672115 1 N -73.986572 40.710594 1 19.0 0.5 0.5 4.06 0.0 0.3 24.36
20 1 2016-01-01 00:00:07 2016-01-01 00:09:49 1 1.80 -73.989166 40.726589 1 N -74.009483 40.715073 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
21 2 2016-01-01 00:00:08 2016-01-01 00:18:51 1 3.09 -73.999069 40.720173 1 N -73.973389 40.756561 2 14.5 0.5 0.5 0.00 0.0 0.3 15.80
22 2 2016-01-01 00:00:08 2016-01-01 00:04:37 1 0.72 -73.997139 40.747219 1 N -74.004486 40.751797 2 5.0 0.5 0.5 0.00 0.0 0.3 6.30
23 2 2016-01-01 00:00:08 2016-01-01 00:03:24 1 0.69 -73.997414 40.736675 1 N -73.985664 40.732681 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
24 1 2016-01-01 00:00:09 2016-01-01 00:19:03 3 5.30 -73.997131 40.736961 1 N -73.928421 40.755581 1 18.0 0.5 0.5 3.85 0.0 0.3 23.15
25 1 2016-01-01 00:00:09 2016-01-01 00:07:18 2 1.20 -73.963913 40.712173 1 N -73.951332 40.712200 2 7.0 0.5 0.5 0.00 0.0 0.3 8.30
26 2 2016-01-01 00:00:10 2016-01-01 00:06:15 2 0.97 -73.999397 40.743900 1 N -73.988876 40.745319 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
27 2 2016-01-01 00:00:10 2016-01-01 00:02:20 1 0.87 -73.954407 40.778069 1 N -73.948929 40.788582 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
28 2 2016-01-01 00:00:12 2016-01-01 00:01:17 1 0.13 -73.991653 40.754559 1 N -73.990601 40.756119 2 3.0 0.5 0.5 0.00 0.0 0.3 4.30
29 1 2016-01-01 00:00:14 2016-01-01 00:13:02 1 2.40 -73.995598 40.744240 1 N -73.985458 40.768711 1 11.0 0.5 0.5 3.05 0.0 0.3 15.35
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9971 2 2016-01-02 01:45:26 2016-01-02 01:51:52 2 2.21 -73.992805 40.747776 1 N -73.986519 40.771732 2 8.5 0.5 0.5 0.00 0.0 0.3 9.80
9972 1 2016-01-02 01:45:27 2016-01-02 01:48:02 1 0.60 -73.988205 40.759205 1 N -73.982246 40.767685 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
9973 1 2016-01-02 01:45:27 2016-01-02 02:00:46 1 4.60 -73.989395 40.760468 1 N -73.920860 40.743256 2 16.0 0.5 0.5 0.00 0.0 0.3 17.30
9974 1 2016-01-02 01:45:27 2016-01-02 02:17:22 3 12.90 -74.004295 40.707962 1 N -73.844490 40.722347 1 38.5 0.5 0.5 0.00 0.0 0.3 39.80
9975 1 2016-01-02 01:45:28 2016-01-02 02:03:06 1 6.10 -74.000961 40.731586 1 N -73.941544 40.800468 1 19.0 0.5 0.5 2.22 0.0 0.3 22.52
9976 1 2016-01-02 01:45:28 2016-01-02 01:51:40 1 1.20 -74.010986 40.710609 1 N -74.010986 40.710609 2 6.5 0.5 0.5 0.00 0.0 0.3 7.80
9977 1 2016-01-02 01:45:28 2016-01-02 01:54:08 1 1.60 -73.973106 40.758457 1 N -73.996124 40.760876 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
9978 2 2016-01-02 01:45:28 2016-01-02 01:56:31 3 3.13 -74.002403 40.718761 1 N -73.977814 40.745529 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
9979 2 2016-01-02 01:45:28 2016-01-02 01:58:02 1 3.09 -73.961632 40.764370 1 N -73.919220 40.755932 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
9980 2 2016-01-02 01:45:28 2016-01-02 01:49:02 1 0.91 -73.994820 40.721390 1 N -73.985573 40.727058 1 5.0 0.5 0.5 1.26 0.0 0.3 7.56
9981 1 2016-01-02 01:45:29 2016-01-02 01:54:47 3 2.30 -74.003494 40.741982 1 N -73.981689 40.764687 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
9982 1 2016-01-02 01:45:29 2016-01-02 01:54:09 1 2.00 -73.999969 40.728603 1 N -73.978676 40.744957 1 9.0 0.5 0.5 2.05 0.0 0.3 12.35
9983 2 2016-01-02 01:45:29 2016-01-02 01:57:43 1 4.89 -73.993301 40.720043 1 N -73.952782 40.742481 2 16.5 0.5 0.5 0.00 0.0 0.3 17.80
9984 2 2016-01-02 01:45:30 2016-01-02 01:58:14 5 3.81 -73.972496 40.677151 1 N -73.926888 40.668835 1 13.5 0.5 0.5 2.96 0.0 0.3 17.76
9985 1 2016-01-02 01:45:31 2016-01-02 01:56:48 1 2.40 -73.910530 40.744858 1 N -73.914238 40.759933 2 10.5 0.5 0.5 0.00 0.0 0.3 11.80
9986 1 2016-01-02 01:45:31 2016-01-02 01:49:36 1 0.60 -74.000542 40.729885 1 N -74.004311 40.722778 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
9987 2 2016-01-02 01:45:31 2016-01-02 01:47:39 1 0.43 -73.994431 40.727772 1 N -74.000458 40.727341 2 3.5 0.5 0.5 0.00 0.0 0.3 4.80
9988 2 2016-01-02 01:45:31 2016-01-02 02:04:54 1 7.40 -73.953514 40.775261 1 N -73.881020 40.755943 1 23.5 0.5 0.5 4.96 0.0 0.3 29.76
9989 2 2016-01-02 01:45:31 2016-01-02 01:55:57 1 4.23 -73.981857 40.746017 1 N -73.943085 40.795063 2 13.5 0.5 0.5 0.00 0.0 0.3 14.80
9990 2 2016-01-02 01:45:31 2016-01-02 01:50:24 1 0.94 -73.983353 40.729210 1 N -73.983353 40.729210 1 5.5 0.5 0.5 1.36 0.0 0.3 8.16
9991 1 2016-01-02 01:45:32 2016-01-02 02:03:24 1 5.90 -73.954666 40.821003 1 N -73.954666 40.821003 1 18.5 0.5 0.5 5.94 0.0 0.3 25.74
9992 2 2016-01-02 01:45:32 2016-01-02 01:55:26 1 2.83 -73.985641 40.763119 1 N -74.001694 40.732391 1 10.5 0.5 0.5 2.36 0.0 0.3 14.16
9993 2 2016-01-02 01:45:32 2016-01-02 02:02:24 2 6.31 -73.972076 40.754040 1 N -73.869659 40.749451 2 19.5 0.5 0.5 0.00 0.0 0.3 20.80
9994 2 2016-01-02 01:45:32 2016-01-02 01:52:22 1 1.65 -73.992012 40.725880 1 N -74.009697 40.709923 1 7.0 0.5 0.5 1.00 0.0 0.3 9.30
9995 2 2016-01-02 01:45:33 2016-01-02 01:54:45 1 2.05 -73.989403 40.750538 1 N -74.003639 40.725395 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
9996 2 2016-01-02 01:45:34 2016-01-02 01:59:03 1 2.84 -73.974426 40.790932 1 N -73.940430 40.822159 1 12.5 0.5 0.5 3.45 0.0 0.3 17.25
9997 2 2016-01-02 01:45:34 2016-01-02 01:55:11 1 3.45 -73.989151 40.726864 1 N -73.958389 40.765392 1 11.5 0.5 0.5 1.00 0.0 0.3 13.80
9998 2 2016-01-02 01:45:35 2016-01-02 01:52:43 1 1.30 -73.968239 40.755379 1 N -73.956322 40.768002 1 7.0 0.5 0.5 1.70 0.0 0.3 10.00
9999 1 2016-01-02 01:45:37 2016-01-02 01:50:31 1 1.20 -73.982224 40.768620 1 N -73.983765 40.779598 1 6.0 0.5 0.5 2.00 0.0 0.3 9.30
10000 2 2016-01-02 01:45:37 2016-01-02 01:59:47 3 2.69 -73.960518 40.710976 1 N -73.925240 40.698357 2 12.0 0.5 0.5 0.00 0.0 0.3 13.30

10001 rows × 19 columns

In [54]:
 
#help(plotly.offline.iplot)
 
## Insight 1: Passenger numbers
 * Most NY Taxi trips transport solo passengers

Insight 1: Passenger numbers

  • Most NY Taxi trips transport solo passengers
In [2]:
 
import numpy as np
import plotly.plotly as py
#import plotly.offline as offline
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
import plotly.graph_objs as go
init_notebook_mode()
#extract number of people per trip
peps_per_trip_df=df.loc[:, df.columns.str.match('passenger_count')]
peps_per_trip_df.shape
#print(type(peps_per_trip_df))
peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values
#print(type(peps_per_trip))
#layout=go.Layout(title="First Plot", xaxis={'title':'x1'}, yaxis={'title':'x2'})
data = [go.Histogram(x=peps_per_trip)]  #or [dataset1, darset2]
layout = go.Layout(
    title='Histogram of Passenger numbers',
    xaxis=dict(
        title='passenger number'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.2,
    bargroupgap=0.1
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,  filename='People_per_trip_histogram') #this plots in online mode, limit of 50/day in community a/c
#iplot(fig,  filename='People_per_trip_histogram') #This plots when offline; no limit
High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~elmao/0 or inside your plot.ly account where it is named 'People_per_trip_histogram'
Out[2]:
 
## Insight 2: Cash versus Credit 
* New Yorkers prefer to pay with credit card (60:40 ratio in preference of credit card)
* Cash usage remains considerable at 40%. The cash option is a point of difference over competitor Uber.  
* Distribution of fares is similar across cash and credit card payments (median credit card fare is $1 higher than cash fare)
* Peak at $\$52$ represents Manhattan -> JFK airport trips (This has a flat rate fee of $52, source @wikipedia)
 
* NY taxi fares are cheap (compared to Melbourne!). Median fare around \$10

Insight 2: Cash versus Credit

  • New Yorkers prefer to pay with credit card (60:40 ratio in preference of credit card)
  • Cash usage remains considerable at 40%. The cash option is a point of difference over competitor Uber.
  • Distribution of fares is similar across cash and credit card payments (median credit card fare is $1 higher than cash fare)
  • Peak at $52 represents Manhattan -> JFK airport trips (This has a flat rate fee of $52, source @wikipedia)
  • NY taxi fares are cheap (compared to Melbourne!). Median fare around $10
In [10]:
 
# Distribution: Payment by type
#df=df_big  #uncomment to run on whole dataset
# Add histogram data
# extract fares by payment type
# 1=cc, 2=cash, 3=no charge, 4=dispute, 5=unknown, 6=voided trip
fare_paymenttype1=df.loc[df['payment_type'] == 1, 'fare_amount'].values #credit card
fare_paymenttype2=df.loc[df['payment_type'] == 2, 'fare_amount'].values #cash
#fare_paymenttype4=df.loc[df['payment_type'] == 4, 'fare_amount'].values #dispute
fare_payments=np.append(fare_paymenttype1,fare_paymenttype2)
total_paymentstype1=df.loc[df['payment_type'] == 1, 'total_amount'].values   #fare+tips+tols
total_paymentstype2=df.loc[df['payment_type'] == 2, 'total_amount'].values   #fare+tips+tols
tip_amountstype1=df.loc[df['payment_type'] == 1, 'tip_amount'].values   #fare+tips+tols
total_payments=np.append(total_paymentstype1,total_paymentstype2)
numberofCCpays=df.loc[df['payment_type'] == 1, 'payment_type'].sum()
numberofCashpays=df.loc[df['payment_type'] == 2, 'payment_type'].sum()/2
PcentofCCpays=np.round(numberofCCpays*100/(numberofCashpays+numberofCCpays), decimals=1)
#print(PcentofCCpays)
PcentofCashpays=np.round(numberofCashpays*100/(numberofCashpays+numberofCCpays), decimals=1)
#print(PcentofCashpays)
#print(type(fare_paymenttype2[1:10]))
# Group data together
hist_data = [fare_paymenttype1,fare_paymenttype2]
find_median1=np.median(fare_paymenttype1)
find_median2=np.median(fare_paymenttype2)
#print(find_median)
group_labels = ['Credit card', 'Cash']
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=1.0)
fig.layout.update({'title': 'Distribution of Fares'})
fig.layout.xaxis1.update({'title': '$ amounts'})
# Plot!
#py.iplot(fig, filename='Distplot with Multiple Datasets') #online plot mode
iplot(fig, filename='Distplot with Multiple Datasets') #offline mode
display(Math(r'\text{Percentage of credit card payments is } %s \text{%%}' % PcentofCCpays))
display(Math(r'\text{Median credit payment is \$} %s ' % find_median1))
display(Math(r'\text{Percentage of cash payments is  } %s \text{%%}' % PcentofCashpays))
display(Math(r'\text{Median cash payment is \$} %s' % find_median2))
00.020.040.060.080.10.1201020304050607080Export to plot.ly »
Distribution of FaresCashCredit card$ amounts
Percentage of credit card payments is 60.8%
Median credit payment is $9.5
Percentage of cash payments is 39.2%
Median cash payment is $8.5
 
## Insight 3: Fare Breakdown
* Median Tip (credit card data only) is 20% of the fare

Insight 3: Fare Breakdown

  • Median Tip (credit card data only) is 20% of the fare
In [27]:
 
# Group data together
hist_data2 = [fare_payments,total_payments,tip_amountstype1]
group_labels2 = ['Fare', 'Total Charge', 'Tip Amount']
# Create distplot with custom bin_size
fig2 = ff.create_distplot(hist_data2, group_labels2, bin_size=[0.5,0.5,0.4])
fig2.layout.update({'title': 'Breakdown & Distribution of NY Taxi Fares'})
fig2.layout.xaxis1.update({'title': '$ amounts'})
# Plot!
#py.iplot(fig2, filename='Distplot with Multiple Datasets2') # online plot option
iplot(fig2, filename='Distplot with Multiple Datasets2') # offline plot option
find_mediantip=np.median(tip_amountstype1)
Med_tip_percentage=np.round(find_mediantip*100/find_median1, decimals=1)
display(Math(r'\text{Median tip payment (Credit card payment data only) is \$} %s ' % find_mediantip))
display(Math(r'\text{Median tip percentage (Credit card payment data only) is } %s \text{%%}' % Med_tip_percentage))
00.10.20.30.4010203040506070Export to plot.ly »
Breakdown & Distribution of NY Taxi FaresTip AmountTotal ChargeFare$ amounts
Median tip payment (Credit card payment data only) is $1.96
Median tip percentage (Credit card payment data only) is 20.6%
 
## Insight 4: Pick-up and Drop-off Locations 
* Manhattan (central business zone) is the busiest area for taxi use
* Airports (La Guardia and JFK) feature strongly in usage maps
    * Curiously, people get dropped off to the airports at very fixed locations, while pick-up locations are more diffuse  
        * Is there a culture of people wandering out from the airport and hailing taxis from wherever; no easily to locate taxi ranks; a GPS issue; meters started on the move?
        
        
* People **start taxi journeys** most frequently:
    1. in Manhattan on the **main streets**
    2. on the **main arterial routes** within residential areas (Brooklyn, Queens)
        * The *Sex And The City* imagery of hailing taxis on demand from busy streets is backed up by the data.  Interesting in times of Uber.
    
    
* People **end taxi journeys** most frequently:
    1. again in Manhattan, both on main streets and off the main streets 
    2. at very **diffuse locations** across residential areas (Brooklyn, Queens, The Bronx)
        * The Bronx is a frequent drop-off location, but rarely a pick-up location 
            * An effect of green "boro taxis" since 2013? (Note, however, that boroughs where green taxis can be hailed include The Bronx, Queens and Brooklyn: yet the Bronx taxi pattern is notably different)

Insight 4: Pick-up and Drop-off Locations

  • Manhattan (central business zone) is the busiest area for taxi use
  • Airports (La Guardia and JFK) feature strongly in usage maps
    • Curiously, people get dropped off to the airports at very fixed locations, while pick-up locations are more diffuse
      • Is there a culture of people wandering out from the airport and hailing taxis from wherever; no easily to locate taxi ranks; a GPS issue; meters started on the move?
  • People start taxi journeys most frequently:
    1. in Manhattan on the main streets
    2. on the main arterial routes within residential areas (Brooklyn, Queens)
      • The Sex And The City imagery of hailing taxis on demand from busy streets is backed up by the data. Interesting in times of Uber.
  • People end taxi journeys most frequently:
    1. again in Manhattan, both on main streets and off the main streets
    2. at very diffuse locations across residential areas (Brooklyn, Queens, The Bronx)
      • The Bronx is a frequent drop-off location, but rarely a pick-up location
        • An effect of green "boro taxis" since 2013? (Note, however, that boroughs where green taxis can be hailed include The Bronx, Queens and Brooklyn: yet the Bronx taxi pattern is notably different)
In [53]:
 
# Map the pick up locations
import pandas as pd
import matplotlib  
import matplotlib.pyplot as plt 
from matplotlib import rcParams  
df=df_big
#pd.options.display.mpl_style = 'default' #Better Styling 
matplotlib.pyplot.style.use('ggplot')
new_style = {'grid': False} #Grid off  
matplotlib.rc('axes', **new_style)  
rcParams['figure.figsize'] = (12, 12) #Size of figure  
rcParams['figure.dpi'] = 250
P=df.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3)
#P.set_axis_bgcolor('black') #Background Color
P.set_facecolor('black') #Background Colour
#plt.show()
In [55]:
 
# Map the drop off locations
df=df_big
import matplotlib  
import matplotlib.pyplot as plt 
from matplotlib import rcParams 
##Inline Plotting for jupyter Notebook 
#%matplotlib inline 
#pd.options.display.mpl_style = 'default' #Better Styling  
matplotlib.pyplot.style.use('ggplot')
new_style = {'grid': False} #Grid off  
matplotlib.rc('axes', **new_style)  
 
rcParams['figure.figsize'] = (12, 12) #Size of figure  
rcParams['figure.dpi'] = 250
P=df.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3)  #s is size and alpha is opaque-ness 
P.set_facecolor('black') #Background Colour
plt.show()
 
## Insight 5: Average fare by day and time
* Average fare is similar over weekdays
* Early birds catch the worm: Taxi drivers operating from 3:00 am to 7:00 am earn above average fares 

Insight 5: Average fare by day and time

  • Average fare is similar over weekdays

  • Early birds catch the worm: Taxi drivers operating from 3:00 am to 7:00 am earn above average fares

In [55]:
 
# Times of the day versus average fare.
#df1=[]
df=df_big  #renaming for test stage
print(df.shape)
# Make new column in dataframe with hour of day and day of the week
df['hour'] = pd.to_datetime(df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S').dt.hour
df['day'] = pd.to_datetime(df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S').dt.dayofweek
#find mean fare by weekday
meanfare_byhour=[] #initialise
for i in range(0,24):
    fares_byhour=df.loc[df['hour'] == i, 'fare_amount'].values #hourly fares
    meanfare_byhour.append(np.mean(fares_byhour))
    #print(i)
    #print(meanfare_byhour)
#Numeric weekday convention is 0:'SUN', 1:'Mon', 2:'Tue',3:'Wed',4:'Thu',5:'Fri',6:'Sat'
#find mean fare by weekday
meanfare_byweekday=[] #initialise
#print(meanfare_byweekday)
for i in range(0,7):
    fare_byweekday=df.loc[df['day'] == i, 'fare_amount'].values #weekday fares
    meanfare_byweekday.append(np.mean(fare_byweekday))
    #print(i)
    #print(meanfare_byweekday)
#print(meanfare_byhour)
meanacrosshoursofday=np.mean(meanfare_byhour)
#plot bar chart of mean fare by weekday
data = [go.Bar(
            x=['Sun', 'Mon', 'Tue','Wed','Thu','Fri','Sat'],
            y=meanfare_byweekday
    )]
layout = go.Layout(
    xaxis=dict(tickangle=-45),
    barmode='group',
    title='Mean Fare by Weekday',
    yaxis=dict(
        title='$'
    ),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='basic-barWeekday')   
#plot bar chart of mean fare by hour of day
traceBar1 = go.Bar(
            x=['0:00', '1:00', '2:00','3:00','4:00','5:00','6:00', '7:00','8:00','9:00','10:00', '11:00', '12:00','13:00','14:00','15:00','16:00', '17:00','18:00','19:00','20:00', '21:00', '22:00','23:00','24:00'],
            y=meanfare_byhour,
            name = 'hourly mean fare'
    )
trace2 = go.Scatter(
            x=['0:00','24:00'],
            y=[meanacrosshoursofday,meanacrosshoursofday],
            mode='lines',
            name = 'overall mean'
    )
layout2 = go.Layout(
    xaxis=dict(tickangle=-45),
    barmode='group',
    title='Mean Fares by Hour',
    yaxis=dict(
        title='$'
    ),
)
#syntax note for two traces bar and line in one plot:
#trace1=go.bar( ... )
#trace2=go.bar( ... )
#data2=[trace1,trace2]
#fig2 = go.Figure(data=data2,...
#or include square brackets
#data2=[go.bar( ... )]
#fig2 = go.Figure(data=data2,...
data2 = [traceBar1, trace2]
#print([meanacrosshoursofday,meanacrosshoursofday])
fig2 = go.Figure(data=data2, layout=layout2)
iplot(fig2, filename='basic-barHour')    
    
(2389990, 21)
SunMonTueWedThuFriSat024681012Export to plot.ly »
Mean Fare by Weekday$
0:001:002:003:004:005:006:007:008:009:0010:0011:0012:0013:0014:0015:0016:0017:0018:0019:0020:0021:0022:0023:0024:000246810121416Export to plot.ly »
Mean Fares by Hourhourly mean fareoverall mean$
 
## Insight 6: Busiest City Locations
* Manhattan x 9, plus JFK airport

Insight 6: Busiest City Locations

  • Manhattan x 9, plus JFK airport
In [66]:
x
#Top 10  busiest locations of the city
import reverse_geocoder as rg
from geopy.geocoders import Nominatim
import gmplot
Topnum=10  #Find top number (Topnum) busiest locations in city
df=df_big
#round the lat and long entries 
#Latitude_round=df.loc[df['payment_type'] == 1, 'fare_amount'].values
Latitude_round  = (np.round(df['pickup_latitude'].values/2, decimals=2))*2+0.005   #round and recentre grid box
Longitude_round = (np.round(df['pickup_longitude'].values/2, decimals=2))*2+0.005 #round and recentre grid box
#print(Latitude_round[0:5])
#print(Longitude_round[0:5])
df.loc[:,'GridcodeLat'] = pd.Series(Latitude_round, index=df.index) #add column gridcodes to df
df.loc[:,'GridcodeLon'] = pd.Series(Longitude_round, index=df.index) #add column gridcodes to df
#find 10 locations with most common grid codes
mytable = df.groupby(['GridcodeLat','GridcodeLon']).size()
mytable.sort_values(inplace=True,ascending=False)
totaltrips=mytable.sum()
print('Total trips')
print(totaltrips)
Top10BusyPickupLocations=mytable.head(Topnum)
#print(Top10BusyPickupLocations)
#print(type(Top10BusyPickupLocations))
Top10BusyPickupLocations=Top10BusyPickupLocations.to_frame()
 #find values for later pie chart of top 10 busiest locations by percentage trip pick ups
num_trips=np.array(Top10BusyPickupLocations)
num_trip_perc=num_trips*100/totaltrips
othertrips=100-sum(num_trip_perc)
num_trip_perc=np.append(num_trip_perc,othertrips)
#print(Top10BusyPickupLocations)
#print(type(Top10BusyPickupLocations))
coordinates = Top10BusyPickupLocations.index.values.tolist()
marker_lats = np.array(coordinates)[:,0]
marker_lngs = np.array(coordinates)[:,1]
#radaii=np.arange(30,10,-(30-10)/Topnum)
gmap = gmplot.GoogleMapPlotter(40.75, -73.9, 11) #manual map location boundaries: center_lat, center_lng, zoom
gmap.plot([40.85], [-73.95], 'cornflowerblue', edge_width=10)
gmap.heatmap(marker_lats, marker_lngs, threshold=5, radius=10, gradient=None, opacity=0.6, dissipating=True)
gmap.draw("mymap.html")
Total trips
2389990
[(40.765, -73.97500000000001), (40.745000000000005, -73.995), (40.745000000000005, -73.97500000000001), (40.785000000000004, -73.955), (40.765, -73.955), (40.725, -73.995), (40.765, -73.995), (40.785000000000004, -73.97500000000001), (40.725, -73.97500000000001), (40.645, -73.775)]
In [4]:
 
%%html
%%<iframe src="mymap.html",  width="1000">
#Issues opening in jupyter due to needing API key from google.  To be fixed.  Meanwhile open mymap.html file from directory.
%%
In [70]:
 
#plot pie chart of Top 10 busiest locations
NYToplabels=['Midtown, Manhattan', 
             'Penn Station, Manhattan',
             'Grand Central Station, Manhattan',
             'Upper East Side, Manhattan',
             'Lennox Hill, Manhattan',
             'Lower Manhattan',
             'Hells Kitchen, Manhattan',
             'Upper West Side, Manhattan',
             'East Village, Manhattan',
             'John F. Kennedy International Airport',
             'All other areas']            
            
# Add graph data
trace1={'labels': NYToplabels,
        'values': np.append(num_trips,totaltrips-sum(num_trips)),
        'type': 'pie',
        'name': 'Pick up',
            'domain': {'x': [0, 1],
                       'y': [.4, 1]},
            'hoverinfo':'label+percent+name',
            'textinfo':'none'
        }
data = [trace1]
layout = go.Layout(
    #xaxis=dict(tickangle=-45),
    #barmode='group',
    title='Top Taxi Pick-up Locations',
    #yaxis=dict(
    #    title='$'
    #),
)
fig = go.Figure(data=data, layout=layout)
# Plot!
iplot(fig)
Export to plot.ly »
Top Taxi Pick-up LocationsMidtown, ManhattanAll other areasPenn Station, ManhattanGrand Central Station, ManhattanUpper East Side, ManhattanLennox Hill, ManhattanLower ManhattanHells Kitchen, ManhattanUpper West Side, ManhattanEast Village, ManhattanJohn F. Kennedy International Airport
In [150]:
 
#help(gmplot.GoogleMapPlotter)
#help(HTML)
In [80]:
 
# find addresses of co-ordinates..found two ways of doing this.  Addresses are very awkward to handle due to inconsistancy between addresses 
# Let's go google maps instead (later implemented in above cells)
results = rg.search(coordinates) # default mode = 2, reverse geocode from lat and long to address
print(results)
geolocator = Nominatim()
#locations = geolocator.reverse("40.755,     -73.985")
for i in range(0,Topnum):
        location = geolocator.reverse(coordinates[i])
        PlaceNames=location.address.split(",")
        print([PlaceNames[-8],PlaceNames[-7],PlaceNames[-6]] )
    
#df1.loc[:,'f'] = p.Series(np.random.randn(sLength), index=df1.index) #add column f to df1
#plot table or pie chart
[OrderedDict([('lat', '40.78343'), ('lon', '-73.96625'), ('name', 'Manhattan'), ('admin1', 'New York'), ('admin2', 'New York County'), ('cc', 'US')]), OrderedDict([('lat', '40.71427'), ('lon', '-74.00597'), ('name', 'New York City'), ('admin1', 'New York'), ('admin2', ''), ('cc', 'US')]), OrderedDict([('lat', '40.74482'), ('lon', '-73.94875'), ('name', 'Long Island City'), ('admin1', 'New York'), ('admin2', 'Queens County'), ('cc', 'US')]), OrderedDict([('lat', '40.78343'), ('lon', '-73.96625'), ('name', 'Manhattan'), ('admin1', 'New York'), ('admin2', 'New York County'), ('cc', 'US')]), OrderedDict([('lat', '40.74482'), ('lon', '-73.94875'), ('name', 'Long Island City'), ('admin1', 'New York'), ('admin2', 'Queens County'), ('cc', 'US')]), OrderedDict([('lat', '40.71427'), ('lon', '-74.00597'), ('name', 'New York City'), ('admin1', 'New York'), ('admin2', ''), ('cc', 'US')]), OrderedDict([('lat', '40.76955'), ('lon', '-74.02042'), ('name', 'Weehawken'), ('admin1', 'New Jersey'), ('admin2', 'Hudson County'), ('cc', 'US')]), OrderedDict([('lat', '40.78343'), ('lon', '-73.96625'), ('name', 'Manhattan'), ('admin1', 'New York'), ('admin2', 'New York County'), ('cc', 'US')]), OrderedDict([('lat', '40.71427'), ('lon', '-74.00597'), ('name', 'New York City'), ('admin1', 'New York'), ('admin2', ''), ('cc', 'US')]), OrderedDict([('lat', '40.62205'), ('lon', '-73.7468'), ('name', 'Inwood'), ('admin1', 'New York'), ('admin2', 'Nassau County'), ('cc', 'US')])]
[' Central Park South', ' Diamond District', ' Manhattan']
[' Chelsea', ' Manhattan', ' Manhattan Community Board 4']
[' Murray Hill', ' Manhattan', ' Manhattan Community Board 6']
[' Yorkville', ' Manhattan', ' Manhattan Community Board 8']
[' Lenox Hill', ' Manhattan', ' Manhattan Community Board 8']
[' Five Points', ' Manhattan', ' Manhattan Community Board 2']
[" Hell's Kitchen", ' Manhattan', ' Manhattan Community Board 4']
[' Upper West Side', ' Manhattan', ' Manhattan Community Board 7']
[' Alphabet City', ' Manhattan', ' Manhattan Community Board 3']
['7', ' Terminal 5 Departures', ' Bayswater']
In [ ]:
 
Rendering widgets...